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SUMMARY 

Stochastic  substitution,  the  Gibbs  sampler  and  the  sampling-importance-resampling  algorithm  can  be 
viewed  as  three  alternative  sampling,  or  Monte  Carlo,  based  approaches  to  the  calculation  of  numerical 
estimates  of  marginal  probability  distributions.  The  three  approaches  will  be  reviewed,  and  compared  and 
contrasted,  in  relation  to  various  joint  probability  structures  frequently  encountered  in  applications.  In  par¬ 
ticular,  the  relevance  of  the  approaches  to  calculating  Bayesian  posterior  densities  for  a  variety  of  struc¬ 
tured  models  will  be  discussed  and  illustrated. 
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sampling;  conditional  probability  structure;  posterior  distributions;  data  augmentation; 
hierarchical  models;  missing  data;  variance  components. 
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In  relation  to  a  collection  of  random  variables,  C/j ,  t/2 . suppose  that  either; 

(i)  for  i  =  the  conditional  distributions  j  *  i  are  available,  perhaps  having,  for  some  i. 

reduced  forms  £/,  |  C/, ,  y  e  S;  c  {1 . £),  or 

(ii)  the  functional  form  of  the  joint  density  of  C/j .  H/*  is  known,  perhaps  modulo  the  normalizing 

constant,  and  at  least  one  t/,  |  Uj ,  j  *  i  is  available, 

where  available  is  here  taken  to  mean  that  samples  of  L/,  can  be  straightforwardly  and  efficiently  generated, 
given  specified  values  of  the  appropriate  conditioning  variables. 

The  problem  addressed  in  this  paper  is  the  exploitation  of  the  kind  of  structural  information  given  by 
either  (i)  or  (ii)  in  order  to  obtain  numerical  estimates  of  non-analytically  available  marginal  densities  of 
some  or  all  of  the  t/,-  (when  possible)  simply  by  means  of  simulated  samples  from  available  conditional 
distributions,  and  without  recourse  to  sophisticated  numerical  analytic  methods.  No  claim  will  be  made  that 
the  sampling  methods  to  be  described  are  necessarily  computationally  efficient  compared  with  expert  use  of 
the  latter.  The  attraction  of  the  sampling  based  methods  is  instead  their  conceptual  simplicity  and  ease  of 
implementation  for  users  with  available  computing  resource  but  without  numerical  analytic  expertise.  All 
that  the  user  requires  is  insight  into  the  relevant  conditional  probability  structure,  together  with  techniques 
for  the  efficient  generation  of  appropriate  random  variates  as  described,  for  example,  by  Devroye  (1986) 
and  Ripley  (1987). 

In  Section  2,  we  discuss  and  extend  three  alternative  approaches  put  forward  in  the  literature  for  cal¬ 
culating  marginal  densities  via  sampling  algorithms.  These  are  (variants  of)  the  stochastic  substitution 
algorithm  described  by  Tanner  and  Wong  (1987),  the  Gibbs  sampler  algorithm  introduced  by  Geman  and 
Geman  (1984)  and  the  form  of  importance  sampling  algorithm  proposed  by  Rubin  (1987,  1988). 

We  note  that  the  Gibbs  sampler  has  been  widely  taken  up  in  the  image  processing  literature,  and  in 
other  large-scale  models  such  as  neural  networks  and  expert  systems,  but  that  its  general  potential  for  more 
conventional  statistical  problems  seems  to  have  been  overlooked.  As  we  shall  show  (and  as  has  also  been 
observed  by  Clayton,  1988),  there  is  a  close  relationship  between  the  Gibbs  sampler  and  the  substitution 
algorithm  proposed  by  Tanner  and  Wong  (1987).  We  shall  generalize  the  latter  and  show  that  it  is  as  least 
as  efficient  as  the  Gibbs  sampler,  and  potentially  more  efficient,  given  the  availability  of  distinct  conditional 
distributions  in  addition  to  those  in  (i)  above.  We  note  that,  as  a  consequence  of  the  relationship  between 
the  two  algorithms,  the  convergence  results  established  by  Geman  and  Geman  (1984)  are  applicable  to  the 
generalised  substitution  algorithm.  The  stronger  convergence  results  established  by  Tanner  and  Wong 
(1987)  require  the  availability  of  a  particular  set  of  conditional  distributions,  including  those  in  (i). 

Both  the  substitution  and  Gibbs  sampler  algorithms  are  iterative  Monte  Carlo  procedures,  applicable 
'•'hen  the  kind  of  structural  information  given  by  (i)  above  is  available.  When  the  structural  information  is 
of  the  kind  described  by  (ii),  we  shall  see  that  an  importance  sampling  algorithm  based  on  Rubin  (1987, 
1988)  provides  a  non-iterative,  Monte  Carlo  integration  approach  to  calculating  marginal  densities. 

In  Section  3,  we  illustrate  various  model  structures,  occurring  frequently  in  applications,  where  one  or 
more  of  these  three  approaches  offers  an  easily  implemented  solution.  In  particular,  we  consider  the  calcu¬ 
lation  of  Bayesian  posterior  distributions  in  incomplete  data  problems,  conjugate  hierarchical  models  and 
normal  data  models. 

In  Section  4,  we  briefly  summarize  the  results  of  some  preliminary  computational  experience  in  two 
simple  cases.  Detailed  applications  to  complex,  real  data  problems  will  be  presented  in  a  subsequent  paper. 


Finally,  in  Section  5,  we  provide  a  summary  discussion. 


2.  Sampling  approaches 

In  the  sequel,  we  shall  assume  that  we  are  dealing  with  real,  possibly  vector-valued,  random  variables 
having  a  joint  distribution  whose  density  function  is  strictly  positive  over  the  (product)  sample  space.  This 
ensures  that  knowledge  of  all  full  conditional  specifications  (such  as  in  (i)  of  Section  1)  uniquely  defines 
the  full  joint  density:  see,  for  example,  Besag  (1974).  Throughout,  we  shall  assume  the  existence  of  densi¬ 
ties,  with  respect  to  either  Lebesgue  or  counting  measure,  as  appropriate,  for  all  marginal  and  conditional 
distributions.  The  terms  distribution  and  density  will  therefore  be  used  interchangeably. 

Densities  will  be  denoted,  generically,  by  square  brackets,  so  that  joint,  conditional  and  marginal 
forms  appear,  for  example,  as  [X,Y],  [X|y]  and  [ X ].  Multiplication  of  densities  will  be  denoted  by  »  ,  so 
that,  for  example, 

[x.y]  = 

The  process  of  marginalisation  (i.e.  integration)  will  be  denoted  by  forms  such  as 

[X\Y]  =  /  [X\Y,Z,W)*[Z\WJ]*[W\Y], 

with  the  convention  that  all  variables  appearing  in  the  integrand  but  not  in  the  resulting  density  have  been 
integrated  out.  Thus,  in  the  above,  the  integration  is  with  respect  to  Z  and  W.  More  generally,  we  shall  use 
notation  such  as 

jh(Z,W)*[W] 

to  denote,  for  given  Z,  the  expectation  of  the  function  h(Z,  W)  with  respect  to  the  marginal  distribution  for 
W. 


2.1  Substitution  algorithm 

The  substitution  algorithm  for  finding  fixed  point  solutions  to  certain  classes  of  integral  equations  is  a 
standard  mathematical  tool,  which  has  received  considerable  attention  in  the  literature:  see,  for  example. 
Rail  (1969).  Its  potential  utility  in  statistical  problems  of  the  kind  we  are  concerned  with  in  this  paper  was 
recently  observed  by  Tanner  and  Wong  (1987)  and  associated  discussion.  Briefly  reviewing  the  essence  of 
their  development  using  the  notation  introduced  above,  we  have 

[xi  =  /[xin*m  a) 

[Y]  =/mxi*m,  (2) 

so  that,  substituting  (2)  into  (1),  we  obtain 

(X]  =  j  [X\Y]*  j  [K|X']*(*'J  =  //t(X,X')*[X'],  (3) 

where 

h(X.X')  =  /  [X|n*[T|X'], 

with  X'  appearing  as  a  “dummy  argument”  in  (3)  and,  of  course,  [X]  =  [X'].  Now  suppose  that,  on  the 
right-hand  side  of  (3),  [X'l  were  replaced  by  [X'],,  to  be  thought  of  as  an  estimate  of  [X]  s  (X'l  arising  at 
the  ith  stage  of  an  iterative  process.  Then,  (3)  implies  that,  for  some  (X]1+1, 

[X],-  +  i  =  /  h(X.X')  *  [X'l, 

=  MX),, 

in  a  notation  making  explicit  the  fact  that  /*  is  the  integral  operator  associated  with  h.  Exploiting  standard 
theory  of  such  integral  operators.  Tanner  and  Wong  (1987)  show  that,  under  mild  regularity  conditions,  this 
iterative  process  has  the  following  properties  (with  obviously  analogous  results  for  [T]). 


TW1  (uniqueness)  The  true  marginal  density,  (X],  is  the  unique  solution  to  (3). 
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TW2  (convergence)  For  any  [X]0.  the  sequence  [X ] | ,  [X ] 2 . defined  by  [X]1  +  1  =  /*, [X ] , ,  i  =  0,1,... 

converges  monotonically  in  L,  to  [X], 

TW3  (rate)  j  |  [X];-  [X ] )  ->  0  geometrically  in  i. 

Extending  the  substitution  algorithm  to  three  random  variables  X,  Y,Z  we  may  write,  analogous  to  (1) 
and  (2), 


[X]  =/[X,Z 

in*m. 

(4) 

in  =  /  [y,x 

I  ZWZ], 

(5) 

[z]  =/[z,y  | 

x]*m. 

(6) 

Substitution  of  (6)  into  (5)  and  then  (5)  into  (4)  produces  a  fixed  point  equation  analogous  to  (3).  A 
new  h  function  arises  with  associated  integral  operator  /*,  whence  TWl,  TW2  and  7W3  will  continue  to 
hold  in  this  extended  setting.  Extension  to  k  variables  is  straightforward.  A  noteworthy  by-product,  using 
7W1,  is  a  simple  proof  that  under  weak  conditions  specification  of  the  conditional  distributions 
[Ur  rm,  |  U,],  s  =  1,2 ,...,k  uniquely  determines  the  joint  density. 

2.2  Substitution  sampling 

Returning  to  (1)  and  (2),  suppose  that  [X  |  K]  and  [ y JX ]  are  available  in  the  sense  defined  at  the 
beginning  of  Section  1.  For  an  arbitrary  (possibly  degenerate)  initial  density  [X]0  draw  a  single  X{0)  from 
[X]0.  Given  X<0),  since  [T|X]  is  available  draw  F(1)  -  (K|X<0>],  whence  from  (2)  the  marginal  distribution 
of  y(1)  is  [X] ,  =  j  [y|X]  *  [X]0.  Now  complete  a  cycie  by  drawing  X(1)  -  [X | y(1>] -  Using  (1),  we  then 
have 

xw~  m,  =  /[xin*m,  =  /«*.*>  u']0  =  /*[x]0. 

Repetition  of  this  cycle  will  produce  Y<2)  and  X(2)  and  eventually,  after  i  iterations,  the  pair  (X(i),  Y(,))  such 
that 

X(0-^->X  ~  (*J.  Y  -  [Y], 

by  virtue  of  TW2.  Repetition  of  this  sequence  m  limes  each  to  the  ith  iteration  will  generate  m  i.i.d.  pairs 
(Xj-l\  Yj‘^),  j  =  l,...,m.  We  call  this  generation  scheme  substitution  sampling.  Note  that  though  we  have 
independence  across  j  we  have  dependence  within  a  given  j.  Some  practical  experience  with  regard  to  the 
autocorrelation  and  cross  correlation  in  very  special  cases  of  the  sequence  ((X;<r>,  Y^},  i=  1,2,...  is 
described  in  Clayton  (1988). 

If  we  terminate  all  repetitions  at  the  ith  iteration,  the  proposed  density  estimate  of  [X]  (with  an  analo¬ 
gous  expression  for  [ y ))  is  the  Monte  Carlo  integration 

[X],  =  -  I  [x|y/°i.  (.7) 

m  J=i 

Note  that  the  X,(0  are  not  used  in  (7)  (see  Section  2.6)  and  in  fact  need  not  be  generated  unless  [K]  is  to  be 
estimated  as  well. 

We  note  that  this  version  of  the  substitution  sampling  algorithm  differs  slightly  from  the  Imputation- 
Posterior  (IP)  algorithm  in  Tanner  and^Wong  (1987).  They  propose,  at  each  iteration  /,  l  =  1,2 . i,  crea¬ 

tion  of  the  mixture  density  estimate,  [X]/,  of  the  form  in  (7),  with  subsequent  sampling  from  |X],  to  begin 
the  next  iteration.  This  mechanism  introduces  the  additional  randomness  of  equally  likely  selection  from 
the  K;(h  before  obtaining  an  X<0.  We  suspect  this  simple  random  sampling  of  the  Y{,)  was  introduced  to 
allow  m  to  vary  across  iterations  but  it  seems  unnecessary.  Systematic  sampling  of  the  Xy<r>  (as  we  propose 
for  m  constant)  is  simpler  and,  as  the  distribution  theory  above  shows,  gives  a  convergent  procedure. 
Empirical  investigation  reveals  little  difference  between  the  two  modes  of  sampling  the  Y;{1)  with  regard  to 
the  goodness  of  the  resultant  estimated  marginal  density  at  the  ith  iteration.  The  unnecessary  resampling 
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implicit  in  Tanner  and  Wong  has  also  been  noted  by  Clayton  (1988),  and  systematic  selection  of  the  Y} 0 
was  also  proposed  by  Morris  (1987b)  in  his  discussion  of  the  Tanner  and  Wong  paper. 

The  Lx  convergence  of  [X],  to  [X]  is  most  easily  studied  by  writing 

/  |[X],-(X]|  «  /  |[X],-[X],|  +  /  |[X],-[X]|. 

The  second  term  on  the  right-hand  side  can  be  made  arbitrarily  small  as  i  — >  »  as  a  consequence  of 

*  P 

TW2  above.  The  first  term  in  the  r.h.s.  can  be  made  arbitrarily  small  as  m  — >  °°  since  [X], - >  [X],  for 

almost  all  X  (Glick,  1974). 

Extension  of  the  substitution  sampling  algorithm  to  more  than  two  random  variables  is  straightfor¬ 
ward.  We  illustrate  using  the  three  variable  case  assuming  the  three  conditional  distributions  in  (4-6)  are 
available.  Taking  an  arbitrary  starting  marginal  density  for  X,  say  [X]0,  we  draw  X(0>  -  [X]0  then 
(Z(0)\ y<0)')  _  [Z,y|X<0)],  then  (Y°\X(oy)  -  [r,X|Z(0>‘]  and  finally  (X(1>,Z(1>)  -  [X,Z|y(1)],  A  full 
cycle  of  the  algorithm  (i.e.  to  generate  X(1)  starting  from  X(0)  thus  requires  six  generated  variates  rather 
than  the  two  we  saw  earlier.  Repeating  such  a  cycle  i  times  will  produce  (X(,\  y<‘>,Z(0).  The  above  theory 

ensures  that  X(i) - >  X  -  [X],  YM-^-+  Y  ~  [T]  and  Zw  -  — >  Z  -  [Z].  If  we  repeat  the  entire  process  m 

times  we  obtain  i.i.d.  (Xj‘^,  Y^.Zf'*),  j  =  1...  ,,m  (independent  between,  but  not  within,  j's).  Note  that 
implementation  of  the  substitution  sampling  algorithm  does  not  require  specification  of  the  full  joint  distri¬ 
bution.  Rather,  what  is  needed  is  the  availability  of  [X,Z|y],  [y,X|Z]  and  [Z,y|Z],  Of  course,  in  many 
cases  sampling  from,  say,  [X,Z|y]  requires,  for  example,  [X|y,Z]  and  [y|Z],  i.e.  the  availability  of  a  full 
conditional  and  also  a  reduced  conditional  distribution.  Paralleling  (7),  the  density  estimator  of  [X] 
becomes 

-  I  [Xlyf.Zf]  (8) 

m  ja  1 

with  analogous  expressions  for  estimating  [y ]  and  [Z],  Lx -convergence  of  (8)  to  [X]  again  follows. 

For  k  variables,  Ux%...,Uk,  the  substitution  sampling  algorithm  will  require  k(k-  1)  random  variate 
generations  to  complete  a  cycle.  If  we  run  m  sequences  out  to  the  ith  iteration  (a  total  of  mik(k-  1)  random 

generations)  we  obtain  m  i.i.d.  i-tuples  (U\? . Ukp),  j  =  1,...,«  with  the  density  estimator  for  [C/,], 

s  =  l,...,Jfc  being 

ll/Ji  =  -  I  IU.  I  U,  =  UP,  t*  s).  (9) 

**  j=  l 

2.3  Gibbs  sampling 

Suppose  we  write  (4)— (6)  in  the  form 

[X]  =  J  [X|z,y]  •  [z|y]  *  [Y], 

[Y]  =  j  [Y\X,Z)*[X\Z]*[Z],  (10) 

[Z1  =  /  [Z|y,xi*[y|X]*[X]. 

Implementation  of  substitution  sampling  requires  availability  of  all  six  conditional  distributions  on  the 
r.h.s.  of  (10),  rarely  the  case  in  practice.  As  noted  at  the  beginning  of  Section  2,  the  full  conditional  distri¬ 
butions  alone,  [X|y,Z],  [y|Z,X],  [Z|X,  Y],  will  uniquely  determine  the  joint  distribution,  and  hence  the 
marginal  distributions,  in  the  situations  under  study.  An  algorithm  for  extracting  the  marginal  distributions 
from  these  full  conditional  distributions  was  formally  introduced  in  Geman  and  Geman  (1984)  and  is  known 
as  the  Gibbs  sampler. 

The  Gibbs  sampler  was  developed  and  has  been  mainly  applied  in  the  context  of  complex  stochastic 
models  involving  very  large  numbers  of  variables,  such  as  image  reconstruction,  neural  networks  and  expert 
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systems.  In  these  cases,  direct  specification  of  a  joint  distribution  is  typically  not  feasible.  Instead,  the  set 
of  full  conditionals  is  specified,  usually  by  assuming  that  an  individual  full  conditional  distribution  only 
depends  upon  some  'neighbourhood'  subset  of  the  variables  (a  reduced  form,  in  the  terminology  of  (i)  in 
Section  1).  More  precisely,  for  the  set  of  variables  Ux,U2,...,Uk 

[Ui\Uj,j  *  i]  s  [UiWjJsSi],  i  =  1 . k,  (11) 

where  5,  is  a  ‘small  neighbourhood'  subset  of  (1,2 . k}.  A  crucial  question  to  ask  is  under  what  cir¬ 

cumstances  the  specification  (11)  uniquely  determines  the  joint  distribution.  The  answer  is  taken  up  in 
great  detail  in  Geman  and  Geman  (1984),  involving  concepts  such  as  graphs,  neighbourhood  systems, 
cliques,  Markov  Random  Fields  and  Gibbs  distributions.  We  refer  the  reader  to  that  reference  for  details. 
In  all  the  examples  we  shall  consider,  the  joint  distribution  will  be  uniquely  defined.  Our  k's  will  be  small 
to  moderate  and  the  available  set  of  full  conditional  distributions  will,  in  fact,  be  calculated  from 
specification  of  the  joint  density. 

Gibbs  sampling  is  a  Markovian  updating  scheme  which  proceeds  as  follows.  Given  an  arbitrary  start¬ 
ing  set  of  values  we  draw  C/{1>  -  [£7 1 ( C/^0> . f/t<0)]  then  U2l)~ 

[U2\Ul'\U$0) . f/*(0)],  £/3(1)  -  (f/3|f/I<,)(/2(1),f/i0> . f/*<0)].  -  and  so  on,  up  to  £/*a)  - 

[f/t|f/i(1),...,f/*(i>i].  Thus  each  variable  is  ‘visited’  in  the  ‘natural’  order  and  a  cycle  in  this  scheme 
requires  K  random  variate  generations.  After  i  such  iterations  we  would  arrive  at  (t/,(‘\...,  UP).  Geman 
and  Geman  show,  under  mild  conditions,  that  the  following  results  hold. 

GG1  (convergence)  Up  U ,  -  [I/,]  as  i  —*  »». 

In  fact,  a  slightly  stronger  result  is  proven.  Rather  than  requiring  that  each  variable  be  visited  in 
repetitions  of  the  natural  order,  convergence  still  follows  under  any  visiting  scheme  provided  that  each 
variable  is  visited  infinitely  often  (i.o.). 

GG2  (rate)  Using  the  sup  norm,  rather  than  the  L,  norm,  the  joint  density  of  (Up . UP)  converges  to 

the  true  joint  density  at  a  geometric  rate  in  i,  under  visiting  in  the  natural  order.  A  minor  adjustment  to  the 
rate  is  required  for  an  arbitrary  i.o.  visiting  scheme. 

GG3  (ergodic  theorem)  For  any  measurable  function  T  of  Ul,...,Uk  whose  expectation  exists, 

Umi  £  T(Up . Up)  E(T(UX . Uk)). 

*-*“  ‘  i*i 


As  in  the  previous  section,  Gibbs  sampling  through  m  replications  of  the  above  i  iterations  (a  total  of 

mik  random  variate  generations)  produces  m  i.i.d.  1: -tuples  (U\f . Up),  j  =  1 . m,  with  the  proposed 

density  estimate  for  [Us]  having  exactly  the  form  (9). 


It  seems  sensible  that  we  might  utilize  the  up,  l  <  i  to  improve  upon  (9).  More  precisely,  defining 

Wsij  -  [U,\U,  =  Up,  t  *  r],  (9)  is  £  WsiJ/m.  Suppose  we  replace  WSIJ  by  Vu/  =  £  Wsijh  and  consider 

i* i  i=i 


[u.\:  =  -  £  vs 


c) 


By  GG3,  as  i  — >  »,  VJiy  — — »  [U,  ],  a  constant;  the  convergence  of  (*)  depends  only  on  i  not  on  m.  By 
contrast,  as  i  — >  «>,  Wtij  converges  in  distribution  to  a  random  variable.  Upon  averaging  such  random  vari¬ 
ables  over  m  we  obtain  convergence  to  [U,].  In  terms  of  variability  var[(/J/  <  varfl/,],  since 
vai(V,it)  <  va TiW,it)  whence  (*)  is  better  than  (9).  In  practice  (9)  has  performed  so  well  that  we  haven't 
needed  to  resort  to  (*).  Also  in  practice  we  might  delete  from  VUJ  several  of  the  early,  "far  from  con¬ 
verged”,  W!tj . 
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2.4  Relationship  Between  Gibbs  Sampling  and  Substitution  Sampling 

It  is  apparent,  as  also  noted  by  Clayton  (1988),  that  in  the  case  of  two  random  variables  Gibbs  sam¬ 
pling  and  substitution  sampling  are  identical.  For  more  than  two  variables,  using  (10)  and  its  obvious  gen¬ 
eralization  to  k  variables,  we  see  that  Gibbs  sampling  assumes  the  availability  of  the  set  of  k  full  condi¬ 
tional  distributions  (the  minimal  set  needed  to  uniquely  determine  the  joint  density).  The  substitution  sam¬ 
pling  algorithm  requires  the  availability  of  a  total  of  k(k-  1)  conditional  distributions  including  all  the  full 
conditionals. 

Gibbs  sampling  is  known  to  converge  slowly  in  applications  with  k  very  large.  Regardless,  fair  com¬ 
parison  with  substitution  sampling,  in  the  sense  of  total  amount  of  random  variate  generation,  requires  that 
we  allow  the  Gibbs  sampling  algorithm  i(k  -  1)  iterations  if  the  substitution  sampling  algorithm  is  allowed  i. 
Even  so,  there  is  clearly  scope  for  accelerated  convergence  from  the  substitution  sampling  algorithm  since 
it  samples  from  the  ‘correct’  distribution  each  time,  while  Gibbs  sampling  only  samples  from  the  full  condi¬ 
tional  distributions. 

To  amplify  upon  this  point,  we  describe  how  the  substitution  sampling  algorithm  might  be  carried  out 
under  availability  of  just  the  set  of  full  conditional  distributions.  We  shall  see  that  it  can  be  viewed  as  the 
Gibbs  sampler,  but  under  an  i.o.  visiting  scheme  different  from  the  natural  one.  We  present  the  argument  m 
the  three  variable  case  for  simplicity.  Returning  to  (10),  if  [T|X]  is  unavailable  we  can  create  a  sub¬ 
substitution  loop  to  obtain  it  by  means  of: 

(W  =  /[y|x,z]*[Z|*]. 

r  (12) 

[Z\X]=  j  [Z\X,Y]»[Y\X]. 

Similar  subloops  are  clearly  available  to  create  [X\Z]  and  [Z\Y],  In  fact,  for  k  variables  this  idea 
can  be  straightforwardly  extended  to  the  estimation  of  an  arbitrary  reduced  conditional  distribution  given 
the  full  conditionals.  We  omit  the  details. 

The  above  analysis  suggests  that  we  could  in  fact  view  the  reduced  conditional  densities  such  as 
[T|X]  as  ‘available’,  and  that  we  could  thus  carry  out  the  substitution  algorithm  as  if  all  needed  conditional 
distributions  were  available.  In  fact,  [T|X],  for  example,  is  not  ‘available’  in  our  earlier  sense.  Under  the 
subloop  in  (12),  we  can  always  obtain  a  density  estimate  for  [T|Af]  given  any  specified  X,  say  ^(0).  How¬ 
ever,  at  the  next  cycle  of  the  iteration  we  would  need  a  brand  new  density  estimate  for  [T|X]  at  X  =  Xu>. 
Nonetheless,  suppose  we  persevered  in  this  manner,  making  our  way  through  one  cycle  of  (10).  The  reader 
may  verify  that  the  only  distributions  actually  sampled  from  are,  of  course,  the  available  full  conditionals, 
that  at  the  end  of  the  cycle  each  full  conditional  will  have  been  sampled  from  at  least  once,  and  thus  that 
under  repeated  iterations  each  variable  will  be  visited  i.o..  Therefore,  this  version  of  the  substitution  sam¬ 
pling  algorithm  is  in  fact  merely  Gibbs  sampling  with  a  different  but  still  i.o.  visiting  order.  As  a  result, 
GG1,  GG2  and  GG 3  still  hold  (7W1,  TW2,  7W3  apply  directly  only  when  all  required  conditional  distribu¬ 
tions  are  available).  Moreover,  there  is  no  gain  in  implementing  the  Gibbs  sampler  in  this  complicated 
order;  the  natural  order  is  simpler  and  equally  good. 

This  discussion  may  be  readily  extended  to  the  case  of  k  variables.  As  a  result,  we  conclude  that 
when  only  the  set  of  k  full  conditionals  is  available  the  substitution  sampling  algorithm  and  the  Gibbs 
sampler  are  equivalent. 

Furthermore,  we  can  now  see  when  substitution  sampling  offers  the  possibility  of  acceleration  relative 
to  Gibbs  sampling.  This  will  occur  when  some  reduced  conditional  distributions,  distinct  from  the  full 
conditional  distributions,  are  available.  Suppose  we  write  the  substitution  algorithm  with  appropriate  condi¬ 
tioning  to  capture  these  available  reduced  conditionals.  As  we  traverse  a  cycle,  we  would  sample  from 
these  distributions  as  we  come  to  them,  otherwise  sampling  from  the  full  conditional  distributions. 

An  example  will  help  to  clarify  this  idea.  One  way  to  carry  out  the  Gibbs  sampler  in  (10)  is  to  follow 
the  ‘substitution’  order  rather  than  the  natural  order.  That  is,  given  an  initial  X<0),  y<0),Z<0>  we  start,  for 
example,  at  t1-!  bottom  line  of  (10),  drawing 
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(i)  Y(oy  from  fX|X(0,,Z<0>], 

(ii)  Z<0)'  from  [Z  |  K<0>  .  X<0)] , 

(iii)  X(0)'  from  [X|Z<0)',X(0r], 

(iv)  X(,)  from  [X|X(0)',Z<or], 

(v)  Z(1)  from  [Z|X<l),X(0>'], 

(vi)  X(l)  from  [Z\Yw,Z(l)]. 

Thus,  in  this  case,  one  cycle  using  the  substitution  order  corresponds  to  two  cycles  using  the  natural 
order.  Suppose,  however,  that,  in  addition  to  the  full  conditional  distributions,  [Z|X],  say,  is  available  and 
is  distinct  from  (Zj X,Y).  Following  the  substitution  order,  at  step  (v)  we  would  instead  draw  Z(l)  from  the 
‘correct’  distribution,  (Z|X(1>]. 

In  Section  3,  we  provide  classes  of  examples  where  distinct  reduced  conditional  distributions  will  be 
available  and  classes  where  they  generally  will  not.  In  Section  4,  we  present  some  preliminary  computa¬ 
tions  which  attempt  to  quantify  the  acceleration  in  convergence  which  arises  from  having  available  distribu¬ 
tions  additional  to  the  full  conditionals. 

2J  The  Rubio  Importance  Sampling  Algorithm 

Rubin's  comments  (1987)  to  Tanner  and  Wong  include  the  suggestion  of  a  non-iterative  Monte  Carlo 
method  for  generating  marginal  distributions  utilizing  importance  sampling  ideas.  We  present  the  basic 
idea  first  in  the  two  variable  case.  Suppose  we  seek  the  marginal  distribution  of  X,  given  only  the  func¬ 
tional  form  (modulo  the  normalizing  constant)  of  the  joint  density  [X,  Y]  and  the  availability  of  the  condi¬ 
tional  distribution  [X|X]  (a  special  case  of  the  conditions  described  in  (ii)  of  Section  1). 

Suppose  further,  as  is  typically  the  case  in  applications,  that  the  marginal  distribution  of  X  is  not 
known.  Choose  an  importance  sampling  distribution  for  Y  which  has  positive  support  wherever  [X]  does 
and  which  has  density  [X],,  say.  Then  [X|X]*[X],  provides  an  importance  sampling  distribution  for 
(X,Y).  Suppose  we  draw  i.i.d.  pairs  (X,,Y/),  l  =  1 N  fro"1  this  joint  distribution;  for  example,  by  draw¬ 
ing  Y,  from  [X],  and  then  X,  from  {X|X,].  Rubin’s  idea  is  to  calculate  r,  =  [X,,Y,]/[X,  |X,]*[X,],, 
/  =  l,...,Af  and  then  estimate  the  marginal  density  for  [X)  by 

I  [*|X,]r, 

[X]  =  ^ - .  (13) 

I  r, 

/-i 

Note  the  important  fact  that  [X, X]  need  only  be  specified  up  to  a  constant  since  the  latter  will  cancel  in 
(13).  In  other  words,  we  do  not  need  to  evaluate  the  normalizing  constant  for  (X,  X].  This  feature  is 
exploited  in  the  examples  of  Section  3. 

By  dividing  the  top  and  bottom  of  (13)  by  N  and  using  the  Law  of  Large  Numbers,  we  immediately 

have: 

R]  (convergence)  [X]  — >  [XJ  w.p.  1  as  N  — *  °°  for  almost  every  X. 

If,  additionally,  [X|X]  is  available  we  immediately  have  an  estimate  for  the  marginal  distribution  of 
X: 


The  successful  performance  of  (13)  will  typically  depend  strongly  upon  the  choice  of  [F],  and  its 
closeness  to  (F].  Thus  the  suggestion  of  Tanner  and  Wong  in  their  rejoinder  to  Rubin’s  discussion  of  their 
paper,  to  perhaps  use  for  [F],  the  density  estimate  created  after  i  iterations  of  the  substitution  algorithm 
merits  further  investigation.  In  fact,  the  whole  problem  of  general  strategies  for  synthesizing  both  the  itera¬ 
tive  and  non-iterative  approaches  under  a  fixed  budget  (total  number  c.  random  generations)  criterion  needs 
considerable  further  study. 

The  extension  of  the  Rubin  importance  sampling  idea  to  the  case  of  k  variables  is  clear.  For  instance, 
when  k  =  3  suppose  we  seek  the  marginal  distribution  of  X  gi  ,cn  the  functional  form  of  [X,  Z  J  up  to  a 
constant  and  the  availability  of  the  full  condi?  anal  [X\Y.Z].  In  this  case,  the  pair  (F,Z)  plays  the  role  of  Y 
in  the  two  variable  case  discussed  above  and,  in  general,  we  need  to  specify  an  importance  sampling  distri¬ 
bution  [F.Z],.  However,  if,  for  example,  [Y\Z]  is  available  we  will  only  need  to  specify  [Z],.  In  any 
case,  we  draw  i.i.d.  triples  ( Xt,Y,,Z l  =  1  and  calculate 

fX/.F/.Z,] 

r'  [x,\Y„z,\*[Y„zt]/ 

The  marginal  density  estimate  for  [X]  then  becomes,  analogous  to  (13), 

I  [X\ Y„Zi)r, 

[X]  =  — - .  (14) 

Z  n 

/» i 

We  note  that  in  the  *- variable  case  the  Rubin  importance  sampling  algorithm  requires  a  total  of  Nk 
random  variate  generations,  while  Gibbs  samping  stopped  at  iteration  i  will  require  mik  generations.  For 
fair  comparison  of  the  two  algorithms,  we  should  therefore  set  S  -  mi.  The  relationship  between  the  esti¬ 
mators  (7)  and  (13)  may  be  clarified  if  we  resample  IT-5? . from  ihe  distribution  which  places  mass 

r,/Zr,  at  Y,,  l  =  1 . N.  We  could  then  replace  (13)  by 

[X]  =  -  l  [X | >7 ]  (15) 

so  that  (7)  and  (15)  are  of  the  same  form.  Relative  performance  on  average  depends  upon  whether  the 
distribution  of  F(0  or  of  F*  is  closer  to  [F],  Empirical  work  described  in  Section  4  suggests  that  under  fad 
comparison  (7)  performs  better  than  (14)  or  (15).  It  seems  preferable  to  iterate  through  a  learning  process 
with  small  samples  rather  than  to  draw  a  one-off  large  sample  at  the  beginning  (an  idea  which  underlies 
much  modem  work  in  adaptive  Monte  Carlo:  see,  for  example.  Smith  el  al,  1987). 

2.6  Density  Estimation 

In  this  section,  we  consider  the  problem  of  calculating  a  final  form  of  marginal  density  from  the  final 
sample  produced  by  either  the  substitution  or  Gibbs  sampling  algorithms.  Since,  for  any  estimated  margi¬ 
nal,  the  corresponding  full  conditional  has  been  assumed  available,  efficient  inference  about  the  marginal 
should  clearly  be  based  on  utilizing  this  full  conditic.ial  distribution.  In  the  simplest  case  of  two  variables. 

this  implies  that  (X|F)  and  the  Y}‘',j  -  1 . m  should  be  used  to  make  inferences  about  [ X ] ,  rather  than 

imputing  X/° ,  j  =  1 . m,  and  basing  inference  upon  these  X;(,)’s.  The  formal  argument  is  essentially  tnat 

of  Rao-Blackwellization,  for  which  we  shall  sketch  a  proof  in  the  context  of  the  density  estimator  itself.  If 
X  is  a  continuous  p-dimensional  random  variable,  consider  any  kernel  density  estimator  of  [XI  based  upon 
the  X/°  (see,  for  example,  Devroye  and  Gyflrfi,  1985)  evaluated  at  say  X0: 
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1 


where  K  is  a  bounded  density  on  Rp  and  the  sequence  (/:m)  is  such  that  as  m  ->  =«,  hm 
mhp  — >  oo.  To  simplify  notation,  set 


= 


0  while 


so  that  4’ =  -  I  Qm.x.&j*')-  Define 

m  jm  1 

=  -  £  EiQ^iX) |y«). 

m  j= i 

3y  our  earlier  theory,  both  4]  an£i  7x?  have  ^e  same  expectation.  By  the  Rao-Blackwell  theorem. 


vvE(QmJ('(X\Y))  s  var QmJ[^X), 


whence 

MSE(Yx ?)  <  AfS£(4J). 

where  MS£  denotes  mean  square  enor  of  the  estimate  of  [X0] . 


Now  for  fixed  Y,  as  m  ->  £(C«jr.(X|y))  ->  [XolH  for  almost  every  X0  by  the  Lebesgue  Density 

Theorem  (see  Devroye  and  Gyflrfi,  p.3).  Thus  in  terms  of  random  variables  we  have 
E{QmJl,{X\Y))-U[XQ\Y].  so  that,  for  large  m,  yjjP  -  [f0],,  MSE(y^)  =  A/5£([X0],.).  whence  [f0),  >s 
preferred  to  4^ . 


1  m 

The  argument  is  simpler  for  estimation  of,  say,  tj  =  E(T(X))  =  [  T’X)  •  [X].  Here  /),  =  —£  r(X,(0) 

;  ;« 1 


is  immediately  seen  to  be  dominated  by  tj2  =  ~  Z  £(7(X)|  K^)- 


3.  Examples 

A  major  area  of  potential  application  of  the  methodology  we  have  been  discussing  is  in  the  calcula¬ 
tion  of  marginal  posterior  densities  within  a  Bayesian  inference  framework.  In  recent  years,  there  have 
been  a  number  of  advances  in  numerical  and  analytic  approximation  techniques  for  such  calculations — see, 
for  example,  Naylor  and  Smith  (1982,  1988),  Smith  et  al  (1985,  1987),  lierney  and  Kadane  (1986),  Shaw 
(1988),  Geweke  (1988) — but  implementation  of  these  approaches  typically  requires  sophisticated  numerical 
analytic  expertise  and  possibly  specialist  software.  In  stark  contrast,  the  three  sampling  approaches  we 
have  discussed  are  essentially  trivial  to  implement  and,  for  many  practitioners,  this  feature  will  more  than 
compensate  for  any  relative  computational  inefficiency.  To  provide  a  flavour  of  the  kinds  of  area  of  appli¬ 
cation  for  which  the  methodology  is  suited,  we  present  six  examples  of  typical  probability  structures  that 
arise. 

3.1  A  Class  of  Multinomial  Models 

We  extend  the  one  parameter  genetic  linkage  example  described  in  Tanner  and  Wong  (1987,  p.5301. 
which,  in  its  most  general  form,  involves  multinomial  sampling  where  some  observations  are  not  assigned 
to  individual  cells  but  to  aggregates  of  cells  (see  Hartley,  1958;  Dempster,  Laird  and  Rubin,  1977).  We 
give  the  model  and  distribution  theory  in  detail  for  a  two  parameter  version,  from  which  the  extension  to  k 
parameters  should  be  clear.  Let  the  vector  Y  =  (7, . Ys)  have  a  multinomial  distribution 

a^O  +  bua2d+  b2.a^r]  +  h3, a4T7  +  bt , c(  1  -  P-  77 )), 
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where  a, ,  bx  ^  0  are  known  and  0  <  c  =  1-  X  i>,-  =  a,  +  a2  =  ^3  +  ^4  <  1-  Thus  8,t)  range  over  9  ?  0, 

>-i 

77  ?  0  and  0+tj  «  1,  so  that  a  three  parameter  Dirichlet  distribution,  Dirichlet(a,,a2,ar3)  may  be  a  natural 
choice  of  prior  density  for  (#,77).  From  the  form  of  [y  1 0, 77]  •  [0,77]  the  reader  will  note  that  obtaining  the 
exact  marginals  [0\Y],[n\Y]  will  be  somewhat  messy  (involving  a  two-dimensional  numerical  integral). 
However,  all  three  sampling  approaches  we  have  described  are  readily  applicable  here  by  considering  the 
unobservable  nine  cell  multinomial  model  for  X  -  ( X,,X2,...,X9 ),  given  by 

Mult(n,  ax0,bx,a29,b2,a3T],b3,aiT],bi,c{\-9-T])). 

From  the  form  of  we  see  that 

[0,771*1  -  DirichIet(Xl  +  X3  +  al,X5  +  X7  +  a2,X9  +  a3), 

whence  [0|X],[77|XJ  are  available  as  Beta  distributions  for  sampling.  Furthermore,  [0|X,  77]  and  [77  |X,  9] 
are  available  as  scaled  Beta  distributions,  scaled  respectively  to  the  intervals  [0, 1  -77]  and  [0, 1-0].  If  we 
let  Yx  =  X,+X2,  Y2  =  X3  +  X4,  Y3  =  Xs  +  X6,  Ya  =  X7  +  Xg,  Ys  =  X9  and  define  Z  =  (X,,X3,X5,X7),  we  see 
that  specification  of  X  is  equivalent  to  specification  of  ( Y,Z ).  Also,  [Z|y, 0,77]  is  the  product  of  four 
independent  Binomials  for  X1,X3,X5,X7,  given  by 

which  are  therefore  readily  available  for  sampling. 

In  the  context  of  Section  2,  we  have  a  three  variable  case,  (0,77 ,Z),  with  interest  in  the  marginal 
distributions  [0|y],  [77 1 K],  {Z|y).  Gibbs  sampling  requires  [0|y,Z,77),  [77|y,Z,0]  and  [Z|y,0,77],  all  of 
which  are  available.  But  in  this  case  the  reduced  distributions  [0|y,Z]  and  [77 1 y, Z]  are  also  available  and 
this  allows  us  to  accelerate  the  substitution  sampling  algorithm.  These  reduced  distributions  also  substan¬ 
tially  simplify  the  Rubin  importance  sampling  algorithm  in  obtaining  [0|y]  and  [77 1 y] ;  only  an  importance 
sampling  distribution  [Z|y],  need  be  specified  (for  example,  a  ‘default’  choice  might  be  binomials  with 
chance  equal  to  one  half).  Detailed  comparison  of  the  performance  of  the  three  algorithm  for  a  specific 
case  of  this  multinomial  class  will  be  given  in  Section  4. 

3.2  Hierarchical  Models  Under  Conjugacy 

Consider  a  general  Bayesian  hierarchical  model  having  k  stages.  In  an  obvious  notation,  we  write  the 
joint  distribution  of  the  data  and  parameters  as 

[y|«ii  *  [Qx\e2\ *  [02\03]...  *  [0*_i|0t]  *  [0*].  (i6) 

where  we  assume  all  components  of  prior  specification  to  be  available  for  sampling.  Primary  interest  is 
usually  in  the  marginal  posterior  [0t  |  y] . 

We  note  in  passing  that  the  collection  of  variables  Y ,  0, . 0*  form  a  random  Markov  field  with  an 

‘adjacent’  neighbourhood  system  and  that  the  joint  distribution  of  the  variables  is  a  Gibbs  distribution.  See 
Geman  and  Geman  (1984)  for  definitions  ard  other  examples. 

The  hierarchical  structure  implies  that 

[0,|r,02]  <  =  1. 

[9i\Y,0j,j*i]=  [0,|0,-l,0(  +  1]  1  <  «  <  *-l.  (17) 

[0t|0*_,]  i  =  k. 

Suppose  we  assume  proper  conjugate  distributions  at  each  stage.  This  is  common  practice  in  the  for¬ 
mulation  of  such  models  except  perhaps  for  [0*  J  which  is  often  assumed  vague.  However,  conjugate  priors 
can  generally  be  made  arbitrarily  diffuse  by  appropriate  choices  of  hyperparameters  and  so  this  case  is  also 
implicitly  subsumed  within  the  conjugate  framework.  [0*]  can.  in  fact,  be  vague  provided  [0*104-!]  is 
still  proper  and  available.  (See  examples  3.4,  3.5.)  Conjugacy  implies  that  the  densities  in  (17)  will  be 
‘available’  as  ‘updated’  versions  of  the  respective  priors  (see  e.g.  Morris,  1983a).  Typically,  no  distinct 
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reduced  conditional  distributions  will  be  available  and  Gibbs  sampling  would  be  used  to  estimate  the 
desired  marginal  posterior  densities.  To  clarify  this  latter  point,  consider  the  case  k  =  3.  The  six  condi¬ 
tional  distributions  in  (10)  would  be  (0, |y, , ©3 ] .  [02|y,0i,03],  [03|y,0,,02],  [03|y,02],  [0,|y,e3]  and 
(02|y. 0,].  The  first  three  are  available  as  in  (17),  the  fourth  is  available  but  is  not  distinct  from  the  third 
and  the  last  two  are  usually  unavailable. 

As  a  concrete  illustration,  consider  an  exchangeable  Poisson  model,  which  will  be  further  illustrated 
in  Section  4  with  the  reanalysis  of  a  published  data  set.  Suppose  we  observe  independent  counts,  j,,  over 

differing  lengths  of  time,  r,  (with  resultant  rate  p,  =  s ••//,-),  i  =  1 . p.  Assume  [ s,  | A,  ]  =  P0(X,t,)  and  that 

the  A,  are  i.i.d.  from  G(a,/3),  with  density  Xf~le~x,^/(iar(a).  The  parameter  a  will  be  assumed  known 
(in  practice,  we  might  treat  a  as  a  ‘tuning’  parameter  or  perhaps,  in  an  empirical  Bayes  spirit,  estimate  it 
from  the  marginal  distribution  of  the  r,’s)  and  /?,  in  turn,  is  assumed  to  arise  from  an  Inverse  Gamma 
distribution  lG(y, 8)  with  density  5re~slP//3r*iIXy).  (A  diffuse  version  of  this  final  stage  distribution  is 
obtained  by  taking  S  and  y  to  be  very  small,  perhaps  zero.) 


Letting  Y  =  (r,,..., sp),  the  conditional  distributions  [Xj\Y]  are  sought.  The  full  posterior  of  A;  is 
given  by 

*  G^a+sj.^tj+lj  j,  j  =  1 . p,  (18) 

while  the  full  posterior  for  /?  is  given  by 


[p\Y,Xx,...,Xp]  =  rG(y+pa,lX,  +  S).  (19) 

No  distinct  reduced  conditional  distributions  are  available.  The  conditional  distribution  of  Xj  given  X  and  /) 
is  (18),  regardless  of  which  or  how  many  Xit  i  *  j  are  given.  The  conditional  distribution  of  /J  given  Y  and 

any  subset  of  the  A/s  is  unavailable.  Given  (A(,0),A40) . X(°\f}m)  the  Gibbs  sampler  draws  Aju  - 

G^a  +  Sj,  j,  j  -  l,....p,  and  then  J?(1)  -  /G^y+ap,  X  A/’  +  jj  to  complete  one  cycle.  If  we 

carry  out  m  repetitions  each  of  i  iterations,  generating  (A\'? . i*'),/J/‘>),  /  =  l,...,m,  the  marginal  density 

estimate  for  A;  is 


[Ay|n  =  -  X  G 

”  i- 1 


(20) 


while 


W)  =  -  X  IGiy+ap.ZXV  +  S). 

171  lm  1 


(21) 


Rubin’s  importance  sampling  algorithm  is  also  applicable  in  the  setting  (16),  taking  a  particularly 
simple  form  in  the  cases  k  =  2,3.  For  k  =  3,  suppose  we  seek  [0/yj.  The  joint  density  [0t,02,03|y]  = 
[y,0,,02,03]/|T],  where  the  functional  form  of  the  numerator  is  given  in  (16).  An  importance  sampling 
density  for  [0,,02>03|y]  could  be  sampled  as  [0,  |y,02]  *  [03 102]  *  [02|  n,  for  some  [02|y],.  As 
remarked  in  Section  2.5,  a  ‘good’  choice  for  [02jy],  might  possibly  be  obtained  through  a  few  iterations  of 

the  substitution  sampling  algorithm.  In  any  case,  for  l  =  1 . N  we  would  generate  02,  from  [02|K],,  03, 

from  [ 03 1 02/ ]  and  0U  from  [0||y,02;],  Calculating 

r  _  _ [y,  02;  ■  03;] _ 

'  '  (011^.0211  •  [03, |02,]*[02,|y]/ 


we  obtain  the  density  estimator 


[0i  in  = 


X[0,  |r,  02/] -r/ 
Ir, 


Note  that,  in  the  terminology  of  Rubin,  the  algorithm  in  this  case  can  be  ‘streamlined’  by  writing  the  joint 
density  in  the  numerator  of  r,  as  (0„|y,02/l  *  iy|02/]  *  [02/|03/]  •  [03(]  and  noting  that  r,  does  not  involve 
0U,  so  that  we  need  not  actually  generate  the  9U. 
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Returning  to  the  exchangeable  Poisson  model,  the  estimator  of  the  marginal  density  of  A,  under 
Rubin’s  importance  sampling  algorithm  is 

HI"- 

Here 


n*  triAWAJ/IAin,. 

where  [Y\pt]  is  the  product  of  Negative  Binomial  densities,  i.e. 


11  \sj\rxa){tl+lil)‘>'ay 


while  [Pi]  is  the  inverse  gamma  prior  evaluated  at  fit.  If  [P\Y],  is  not  obtained  from  the  substitution 
sampling  algorithm,  as  in  (21),  an  alternative  choice  is  /G^y+ap,  £  pt  +  1  j.  This  arises  since, 

=  E[x . K\r\iP \y.*i . A,]  «  . 

using  X,  =  pj  in  (19). 


3.3  Multivariate  Normal  Sampling 


A  commonly  occurring  problem  in  combining  continuous  multivariate  data  is  that  often  not  all  vari¬ 
ables  are  observed  for  each  experimental  unit;  see,  for  example,  Dempster,  Laird  and  Rubin  (1977).  If  the 
data  are  sampled  from  multivariate  normal  populations  with  conjugate  priors  for  the  mean  and  covariance 
matrix,  we  have  a  general  class  of  models  where  all  full  conditional  distributions  and  at  least  some  reduced 
conditional  distributions  will  be  available.  We  illustrate  in  the  simplest  case,  where  we  assume 

(f/1*)’  ‘  =  {v^’i  =  1 . n2'  (w^)’  k  =  1 . "3 

are  all  i.i.d.  N(9,A)  with  9  -  N{p,£),  where  9  =  is  not  observable,  but  p  =  d  and  £  are  assumed 
known.  Let 


with  similar  notation  for  V  and  W.  Finally,  let  X  =  (U,  V,  W),  2xN,  with  X  =  N~'X1,  where  1  is  a  column 
vector  oiN  l’s  and  N  =  nt  +  n2  +  rij.  Standard  calculations  show  that  [9 IX]  is  N(t},Q),  where 

tj  =  (NA-'  +  £-lr'(HA-'X  +  £-'p) 


and 


Q  =  {NA~l 

With  the  obvious  partitioning. 


the  marginals  [9x\X]  =  N(tjx,12u)  and  (02|X]  =  N{tj2,Q22)  are  available.  Suppose,  however,  that  V2,IV’1, 
say,  are  unobserved.  Let  Y  =  (U,VX,W2),  Z  =  (V^.W,)  so  that  X  s  ( Y,Z ).  As  in  Section  3.1,  we  have  a 
‘three  variable’  problem,  here  involving  9t,92,Z.  The  full  conditional  distributions  are  all  normal  and 
hence  available.  For  9t  and  92, 

[9X\Y,Z,92]  =  N(t] j  +f2,2f322,(02—  t]2),  f2|(  —flx2f222  f22l ) 

[92 !  Y,  Z,  ]  =  N(t)2  +  f22if2||'(0t  -  rji ),  f222-f22|f3t  i'f2|2). 

Letung  U\  =  af'f/il,  with  similar  notation  for  U2,Vl,V2,Wl,W2,  we  note,  by  sufficiency,  that  with  regard 


* 


-  14  - 


to  Z  we  only  need  the  full  posterior  [Z|  Y,0\,62],  where  ZT  =  (V2,  W,),  ?T  =  (Ux,Uz,V\.W2).  Since 

//0\  /nfld  0  0 

Xr  -  (f/„t/2,V„V2,iV1,W2)|01.02) -N  0  ,  0  n2-'A  0 

\\0/  \  0  0  n{'A 

the  conditional  distribution  [Z| f,0!>02]  is  clearly  normal.  With  the  full  conditionals  and  the  reduced  con¬ 
ditionals  [0i|y,Z],  [02|K,Z]  available,  the  accelerated  substitution  algorithm  can  be  used  to  obtain  [0,  IK], 

[02|n. 

The  Rubin  importance  sampling  algorithm  is  also  straightforward  in  this  case.  Simplifying  notation 
by  working  with  the  sufficient  statistic  (f,Z),  suppose,  for  instance,  we  seek  the  density  estimator  of 
[0,|K].  We  have 

(0,1  H  =  £[Sl\Y,Z,,62l]rl/  Lr, , 

where 

T  _  (^/|0|/.02/]  *  (0U.®2;] 

ieu\r.zlt0ai]»ie2,\Y.zl}»[zl\Y], 

with  X,  m_(Y,Zi)  and  [Z|y],  a  specified  importance  sampling  density.  Thus,  for  /  =  1 . N,  we  generate 

Z,  -  (Z|f],  then  02/  -  [02|y,Z/]  and  Qu  ~  [0, | Y.Zi.Qy].  Again,  the  choice  of  [Z | y could  be  made 
using  a  few  iterations  of  substitution  sampling^  or  perhaps  based  on  the  intuitively  appealing  ‘estimated’ 
conditional  form,  [Z|y,0lt02],  where  9X  =  (nxU  n2Vx)  l(nx  +  n2),  92  =  (.fiiU2  +  n2W3) /(_nx+ n2). 

3.4  Variance  component  models 

Bayesian  inference  for  variance  components  has  typically  required  subtle  numerical  analysis  or  intri¬ 
cate  analytic  approximation,  as  evidenced,  for  example,  in  Box  and  Tiao  (1973,  Chapters  5  and  6).  In 
marked  contrast  to  such  sophistication,  marginal  posterior  densities  for  variance  components  are  readily 
obtained  through  simple  Gibbs  sampling. 

We  illustrate  this  for  the  simplest  variance  components  model  defined  by 

Yii  =  9i+eii,  i-l . K,  y=l . J, 

where,  assuming  conditional  independence  throughout,  [0( \n,a£]  =  N(ji,<j£)  and  [ej;  |ct2 ]  =  N(0,c£),  so 
that  (Ki/|0j,<7,2]  =  N(9j,o?). 

Let  0  =  (0, . 9K),  Y  =  (Ku,...,l*/)  and  assume  that  n,a£,a£  are  independent,  with  priors 

specified  by  [jtx]  -  N(jh,Oo),  [ct|]  -  IG{ax,bx)  and  [cr2]  -  IG[a2,b2],  where  IG  denotes  the  inverse 
gamma  distribution  (as  in  Example  3.2)  and  ,ax,bx,a2,b2  are  assumed  known  (possibly  chosen  to 
correspond  to  diffuse  priors). 

The  joint  distribution  [K,  6,n,a£ ,ff2]  can  be  written  as 

[Y\9,o£]  •  [9\u,<r£]  •  [fi]  •  [<y£]  *  (c t,2]  (22) 

and  we  shall  follow  Box  and  Tiao  (1973,  Chapter  5)  in  focussing  interest  on  [cr||K]  and  [<j2|y]. 

From  the  Gibbs  sampling  perspective,  we  have  a  four  variable  system,  (0,/r,cr|,cr,2)  with  the  follow¬ 
ing  full  conditional  distributions: 
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[<r||y,/z,0,cr,2]  =  [o$\n.O]  =  IGiai  +  lK.bx  +  kZ&i-H?) 

[aj\Y,n,9,al]  =  [a}\ Y,0]  =  IG{a2  +  {KJ  ,b2  +  {Ll(Yir  Otf) 


w.e.d.oh  -  wai.n .  ) 

\  Og+Kao  Og  +KOq  ) 

[B\Y,n,ai,a ?]  =  n(  -ffa  jp+  T?‘ "  2/), 

\JOg  +  < t.  Jas+aj  Jai  +  a;  ) 


where  y  =  (?[.,...,?/),  y,.  =  -  X  1  is  a  Kxl  column  vector  of  l’s,  and  /  is  a  KxK  identity  matrix. 

J  j- 1 

Since  all  these  full  conditionals  are  available,  implementation  of  the  Gibbs  sampler  is  straightforward. 
Moreover,  extensions  to  more  elaborate  variance  component  models  follow  precisely  the  same  pattern,  since 
the  full  conditional  distributions  for  fi  and  9  will  continue  to  be  normal,  and  those  for  the  variance  com¬ 
ponents  will  continue  to  be  inverse  gamma. 

3.5  Normal  means  model 


The  exchangeable  k  -group  normal  means  model  with  different,  unknown  measurement  variances  in 
each  group  provides  a  simple  example  of  an  ‘unbalanced’  class  of  models  which  has  proved  difficult  to 
handle  using  empirical  Bayes  approaches  to  ‘estimating’  posterior  distributions  (see,  for  example,  Morris, 
1983b,  1987a).  Such  models  are  straightforwardly  handled  by  iterative  sampling  approaches,  as  we  saw 
with  the  Poisson  example  of  Section  3.2  and  will  further  illustrate  here  for  this  classical  normal  means 
example. 


Suppose  then,  assuming  conditional  independence  throughout,  that  Yy  -  N(0,,<7(2),  9. ■  -  N{n,x2), 

-  /G(ai,bi),  i  ~  1 . 1,j  =  1 . n  ~  N<jiQ,o%)  and  T2  -  lG{a2,b2),  where  p0,a$  ,ax,bx,a2,b2  are 

assumed  known  (possibly  chosen  to  reflect  diffuse  prior  information).  By  sufficiency,  we  can  confine  atten¬ 
tion  to  y=  {(?,., S?),  i  =  1.. where  YL  =  i ZY(J  and  S  2  =  i-Kfy- F;.)2.  Then,  if  we  write  9  = 

,'|  J  | 

(0| o  =  (<x2,...t  O/2),  the  joint  distribution  of  y \Q,c^ takes  the  form 

[Yle.o2]  *  [9\n,x2]  *  [tx2]  *  [n]  *  [r2],  (23) 

where 

[Yfao2]  •  [9\/x,t2]  *  [a2]  =  n  •  [S2la2]  *  [0,|/i,r2]  *  [cr,2]. 

•  ■I 

There  is.  of  course,  an  obvious  similarity  between  (22)  and  (23),  but  here  interest  is  taken  to  focus  on  the 

[0,|y].  ‘  -  1 . /-  From  the  Gibbs  sampling  perspective,  this  is  a  2/+2  variable  problem:  (9,, a2), 

i  =  1 ,...,/,  together  with  fi  and  r2 .  To  identify  the  forms  of  the  full  conditionals,  we  first  note  that 

[0|  Y.o2,^^]  =  N(e*,D*),  (24) 

where 


_  Jjy+na2 
*  - 

d*  -  aiT\ 


Ay  =  0,  i*j. 


Thus  the  full  conditional  distributions  [9i\Y,9jJ  *  i,cr2,/i,r],  i  =  1 . /,  are,  in  fact,  just  the  normal 

marginals  of  (24)  and  are  therefore  available  for  sampling.  From  (23),  we  easily  see  that 

[cr^y.fl./i.r2!  =  [cx2|y,0I  =  f[  [a2IY,.,S2,9,l. 


where 
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fcr.2 1 y, . *  2 .41  =  !0(a,  + */, . 6,  +  {!(¥,, -  9, )2). 
Finally,  and  closely  resembling  the  forms  obtained  in  Section  3.4, 


Mr.e.c*.*  . 

\  T  +  /<7q  T  +/CTo 


and 


=  [t2^]  ^  IG^  +  y.b^^-fi)2). 

3.6  An  errors-in-variable  model 


Again,  we  consider  a  simple  special  case  in  order  to  illustrate  the  scope  of  the  methodology.  Con¬ 
sider  Y  to  be  a  vector  of  responses  assumed  related  to  levels,  X,  of  a  covariate  according  to  the  straight-line 
model 

r'4i)o(£H- 

Responses  are  obtained  at  specified  levels,  X0,  of  the  covariate  but  suppose,  in  fact,  these  are  not  the  actual 
levels,  Xa.  Rather,  given  the  former,  beliefs  about  the  latter  are  represented  by  Xa  -  N(X0 ,  r2/).  Interest 
centres  on  9  =  .  Q2)  and  to  complete  the  distributional  specification  suppose  we  place  independent  conju¬ 

gate  priors  on  9,  o2  and  t2.  The  joint  distribution  on  (Y,Xa,B, cx^r2)  then  has  the  form 

[YlX^O.a2]  *  [Xa\r2]  *  [t2]  *  [0]  *  [a2],  (25) 

where  again  there  is  obvious  similarity  to  (22)  and  (23).  The  Gibbs  sampler  requires  [©ly.Xa.cr2.^]  = 
[0|y.Xa,<72],  [<r2|l',Xa,0,r2]  =  [tT^y.Xa.d],  lr2\Y,Xa,9.o2)  =  [T*\Xa]  and  [Xa \Y,  9,<j2,t2].  If  we 
assume  a  normal  prior  for  9,  together  with  inverse  gamma  priors  for  a2  and  T2,  we  obtain  normal  full 
conditionals  for  0  and  Xa,  and  inverse  gamma  full  conditionals  for  a2  and  r2.  We  omit  the  details,  which 
are  somewhat  similar  to  those  in  Section  3.5  and  3.4. 


4.  Numerical  Illustrations 
4.1  A  multinomial  model 

We  shall  provide  some  preliminary  insights  into  the  relative  performance  and  properties  of  the  substi¬ 
tution,  Gibbs  and  Rubin  importance  sampling  approaches  by  considering  an  artificial  problem  based  on  the 
class  of  multinomial  models  discussed  in  Section  3.1. 

We  suppose  that  data  Y  =  (Yl,Y2,Y3,Y4,Ys)  =  (14, 1,  l,  1,5)  are  available  as  a  sample  from  the  multi¬ 
nomial  distribution 


Mult(22,  i 9+  i,  j9 ,  irj,  jjj  +  J ,  j(l  -9- 77)), 

and  that  the  prior  for  ( 9,t 7)  is  taken  to  be  a  Dirichlet  (1, 1, 1)  distribution.  In  the  general  notation  of  Section 
3.1,  we  therefore  have  a,  =  j,  b,  =  fc,  a2  =  j,  b2  =  0,  a3  =  i,  b3  =  0,  a«  *  i,  b4  =  J,  a,  =  a2  =  a3  =  l, 
with  interest  centring  on  the  calculation  of  the  marginal  posterior  densities  [9\Y],  [77  |K]. 

By  considering  instead  a  'split-cell'  multinomial,  which  in  this  case  takes  the  form 

x  =  ( xux2 . Xy)  -  Mult(22,  J0,  j,  jfl,  Jrj,  J77,  *,  j(l  - 9~T1))< 

we  can  use  the  analysis  of  Section  3.1  for  this  special  case  of  a  seven  cell  multinomial  to  construct  substitu¬ 
tion  and  Gibbs  sampling  algorithms  involving  9,  1]  and  Z  =  (XUX5). 


The  Gibbs  sampler,  based  on  the  full  conditional  distributions,  iterates  around  the  cycle: 
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set  arbitrary  0(o\  t/(0); 

draw  Z(l)  from  [Z|0<o>,77(O),y],  which  is  given  by  the  product  of  two  independent  binomial  distributions, 
X{1)  -  Bi(y1.2a(O)(l+20(O))-,),^1)  -  Bi(y4,27j(0)(3  +  2r?(0))-1); 


draw  fl(l)  from  [0 1 77 (0),Z(l),y],  which  is  available  since  6/(l-7j(0))  has  a  Beta(X,(1)  +  y2+ 1,  y5+ 1)  distribu¬ 
tion; 

draw  from  [77 \0W,ZW,  Y],  where  77/(l-0(1>)  has  a  Betafyj+.X^'5-*-  l,y5  +  1)  distribution; 
reinitialize  the  cycle  with  0(1\t7(1>  and  iterate,  replicating  each  cycle  m  times. 

The  substitution  sampler  makes  use  of  the  fact  that  and  [t}\Y,Z,9]  can  be  replaced  in  this  case 

by  the  reduced  forms  [djy.Z],  leading  to  iteration  around  the  cycle: 

set  arbitrary  0(O), 

draw  Z(n  from  [Z|0(O),7j<O), Tl.  a  product  of  two  independent  binomial  distributions,  X,n)~ 
Bi(Yt,20(O>(l  +  20(O))~l),xj,)  ~  Bi(y4 , 2r/(0\3  +  2rj (0>)“ '); 

draw  Tj(t)  from  [77 |fl(0),Z<l), y],  where  tj/(1-0<o>)  has  a  Beta(y3  +  ^1)+  l,ys  +  1)  distribution; 

draw  0(,)  from  [0\t}w,Zw,Y],  where  0/(l-77{l))  has  a  Beta(X1<1)+  Y2+  l,y5+  1)  distribution; 

draw  Z(2)  from  [Z|0(1).77(l),  Y],  a  product  of  binomials  analogous  to  the  above; 

draw  77(2)  from  [77 |Z<2>, y],  where  tj  has  a  Beta(y3+A^2)  +  l.Jfj®  +  y2+  T5  +  2]  distribution; 

draw  0(2)  from  [0|77(2),Z<2),y],  where  0/(1~tj(2))  has  a  Beta (Xj(2>  +  y2  +  l,y5+  1)  distribution; 

reinitialize  the  cycle  with  0(2\  r](2)  and  iterate,  replicating  each  cycle  m  times. 

To  compare  the  two  forms  of  iterative  sampling,  we  first  obtained  very  accurate  numerical  estimates 
of  [0|y],  [77 1 y]  using  techniques  described  in  Smith  et  al  (1985,  1987)  and  from  these  then  obtained  the 
‘true’  5,  25,  50,  75  and  95  posterior  percentile  points  for  each  parameter.  Iterative  cycles  of  the  two  sam¬ 
ples  were  then  run,  calibrated  so  that  the  total  number  of  random  variates  generated  was  the  same  in  both 
cases,  as  described  in  Section  2.4.  The  initialization  was  defined  (for  an  arbitrary  generating  seed)  in  each 
case  by  taking  independent  samples  from  0  -  U{ 0, 1),  77  ~  U(0, 1),  subject  to  0  S  0+  tj  €  1.  At  each  cycle, 
m  =  10  drawings  of  the  parameters  were  then  made  and  estimates  of  the  cumulative  posterior  probabilities 
corresponding  to  each  of  the  five  true  percentile  points  for  each  parameter  were  obtained.  This  process  was 
replicated  5000  times,  enabling  us  to  study  the  mean  estimates  of  the  cumulative  probabilities,  together  with 
their  standard  errors,  as  well  as  the  percentage  of  occasions  on  which  each  sampler  was  closest  to  the  true 
value.  A  summary  of  the  results  following  each  of  the  first  four  cycles  is  given  in  Table  1. 
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Table  1 


Comparison  of  Substitution  (S)  and  Gibbs  (G)  samplers 


cdf  value 

estimate  (sd) 

S  closer  than  < 

9 

n 

9 

n 

G 

S 

G 

S 

.05 

.23l(.08) 

,217(.08) 

.033001) 

.044001) 

56% 

75% 

.25 

.504(.10) 

.492009) 

,I77(.04) 

,225(.04) 

55% 

78% 

cycle  1 

.50 

.713(.08) 

.706008) 

,380(.06) 

.459(.06) 

54% 

80% 

.75 

.873(.05) 

.871005) 

,620(.06) 

.706(.06) 

51% 

80% 

.95 

.978(.01) 

.978(.0I) 

,878(.04) 

.926003) 

49% 

80% 

.05 

.067(.04) 

.055003) 

,047(.01) 

.048001) 

56% 

51% 

.25 

.286(.07) 

.266(.07) 

.236004) 

,241(.04) 

56% 

52% 

cycle  2 

.50 

.535008) 

.522007) 

,478(.06) 

.487006) 

53% 

52% 

.75 

.773(.06) 

.768005) 

.728(.05) 

.737005) 

51% 

52% 

.95 

.956002) 

.956002) 

.940002) 

.944002) 

51% 

53% 

.05 

,052(.03) 

.049003) 

,049(.01) 

.049001) 

51% 

50% 

.25 

,254(.06) 

.252(. 06) 

,247(.04) 

.247004) 

51% 

50% 

cycle  3 

.50 

.505007) 

.508007) 

.496(. 06) 

.496006) 

51% 

49% 

.75 

.754(.06) 

.760(.05) 

.746005) 

.747005) 

51% 

50% 

.95 

.951002) 

.954(.02) 

.949002) 

.949002) 

51% 

50% 

.05 

.050003) 

.047(.03) 

.050(.01) 

.050(.01) 

51% 

51% 

.25 

.250006) 

,249(.06) 

.250(. 04) 

.2490 04) 

50% 

51% 

cycle  4 

.50 

.500(.07) 

.505007) 

,499(.06) 

.499006) 

51% 

51% 

.75 

.751(.06) 

.757005) 

,750(.05) 

.751  (.05) 

51% 

51% 

.95 

.950002) 

.953002) 

,950(. 02) 

.951002) 

51% 

49% 

We  note  from  Table  1  that,  initially  (cycles  1  and  2)  the  substitution  sampler  adapts  more  quickly 
than  the  Gibbs  sampler,  particularly  for  r\.  However,  by  the  time  we  reach  the  3rd  and  4th  cycles,  the  two 
approaches  are  performing  indistinguishably.  What  is  astonishing,  perhaps,  is  just  how  remarkably  good 
their  performance  is.  By  the  4th  cycle,  using  only  m  =  10  drawings  and  starting  from  a  default  non- 
informative  baseline,  the  marginal  posterior  density  estimators  based  on  (8)  are  providing  on  average 
extremely  accurate  estimates  of  cumulative  probabilities.  Our  experiences  with  this  and  other  examples 
(see  Section  4.2)  suggest  that  satisfactory  convergence  with  iterative  sampling  requires  only  a  small  fraction 
of  the  levels  of  random  variate  generation  reported  by  Tanner  and  Wong  (1987). 

The  non-iterative  Rubin  importance  sampling  algorithm.  Section  2.5,  requires  us  to  choose  a  sampling 
density,  [Z\Y],,  and  then  to  proceed  as  follows,  for  /  =  l,...,m: 

draw  Z,  from  [Z\Y],,  tj,  from  [tj\Z,Y],  9,  from  [9\ri,Z,Y],  with  the  latter  two  distributions  as  detailed 
above,  thus  creating  a  triple  (9hTU,Z,)\ 

calculate 


[Y,Z,\9h  T7,)«[g„T7,] 


form  estimates. 
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[sin 


mm 


I  [9\t,„Z „Y]r, 
1=1 

m 

I  r, 

/*  1 

I  m\e,.Z,.Y]r, 

1=1 _ 

m 

S  r, 

1=1 


Table  2  shows  the  average  cumulative  posterior  probability  estimates  from  this  approach  based  on 
2500  replicates  of  m  =  40  and  m  =  200.  taking  [Z\Y],  to  be  the  product  of  Xj-Bi^.J)  and 
X5-Bi  (y4.i). 


Table  2 


Estimates  from  the  Rubin  importance  sampling  algorithm 


estimates :  m 

=  40  (200) 

cdf  value 

Q 

n 

.05 

.105(.150) 

.0490049) 

.25 

.3 1 1(.35 1 ) 

.244(.241) 

.50 

.521(.537) 

.4850477) 

.75 

.7390734) 

.7290714) 

.95 

.9390932) 

.9340921) 

Despite  the  much  larger  number  of  drawings  compared  with  the  iterative  samplers,  the  estimation  is  rather 
poor.  In  general,  experience  suggests  that  the  algorithm  is  highly  sensitive  to  the  choice  of  [Z|  Y],  and  that 
the  larger  one-off  simulation  is  no  match  for  iterative  adaptation  via  small  simulations. 


4.2  A  conjugate  hierarchical  model 


We  shall  apply  the  exchangeable  Poisson  model,  discussed  in  Section  3.2,  to  data  on  pump  failures, 
previously  analysed  by  Gaverand  O’Muircheartaigh  (1987),  and  reproduced  here  in  Table  3,  where  s,  is  the 
number  of  failures  and  t,  is  the  length  of  time  in  thousands  of  hours. 


Table  3 


Pump  Failure  data 


Pump  system 

Si 

‘i 

p.TxlO2) 

1 

5 

94.320 

5.3 

2 

1 

15.720 

6.4 

3 

5 

62.880 

8.0 

4 

14 

125.760 

11.1 

5 

3 

5.240 

57.3 

6 

19 

31.440 

60.4 

7 

1 

1.048 

95.4 

8 

l 

1.048 

95.4 

9 

4 

2.096 

191.0 

10 

22 

10.480 

209.9 

Recalling  the  model  structure  of  Section  3.2  and  the  forms  of  conditional  distribution  given  by  (181 
and  (19),  we  shall  illustrate  the  use  of  the  Gibbs  sampler  for  this  data  set,  with  p  =  10,  8  =  1,  y=  0.1  and. 
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for  the  purposes  of  illustration. 


the  latter  derived  by  a  method-of -moments  empirical  Bayes  argument  based  on 

E(Pi)  =  ££(p,|A,)  -  2  -  * 

P 

V'fo)  =  VE[pl\X)  +  EV(p,\Xl)  =  ~s  +  -  S2  =  p-'Xfp.-p)2. 

P  P‘i 

The  cycle  is  defined  as  follows: 

draw  initial  /?<0)  from  [ft] ,  where  ft  -  /G(y,S); 

draw  independent  Aj1’  from  [A;|K,p(0),Ay,y  *  i]  =  [Ay  |y,/i(0)],  which  is  a  C(a  +  r;,(t;  +  l/p(0>)_1)  distribu¬ 
tion,;  =  1 

draw/J(1)  from  [y3|T,  A^\...,Ap  J],  which  is  an  /Cfy+arp.P+XA}15)  distribution; 

reinitialize  the  cycle  with  p(1)  and  iterate,  replicating  each  cycle  m  times. 

Figure  1  shows  a  selection  of  four  marginal  posterior  densities  (for  A2.A4.Ag, A,)  calculated  from  (20) 
following  a  run  of  10  cycles  of  the  algorithm.  In  fact,  three  densities  are  superposed;  one  corresponds  to 
m  =  10;  one  to  m  =  100,  and  the  third  is  the  ‘exact’  density  calculated  using  techniques  described  by  Smith 

et  al  (1985,  1987).  Even  in  the  cases  of  Ag  and  A*  (chosen  as  'worst  cases’  from  A, . A10),  the  densities 

are  hardly  distinguishable— a  rather  remarkable  convergence  from  such  a  small  number  of  drawings. 

Figure  1 


S.  Discussion 

The  emphasis  in  this  paper  has  been  on  providing  a  comparative  review  and  explication  of  three  pos¬ 
sible  sampling  approaches  to  the  calculation  of  intractable  marginal  densities.  The  substitution,  Gibbs  and 
importance  sampling  algorithms  all  share  the  characteristic  of  being  straightforward  to  implement  in  a 
number  of  frequently  occurring  practical  situations,  thus  avoiding  complicated  numerical  or  analytic 
approximation  exercises  (often  necessitating  intricate  attention  to  reparametrisation  and  other  subtleties 
requiring  case  by  case  consideration).  For  this  latter  reason,  if  for  no  other,  the  techniques  deserve  to  be 
better  known  and  experimented  with  for  a  wide  range  of  problems.  We  hope  that  the  unified  exposition 
attempted  here  will  provide  a  general,  clarifying  perspective  within  which  to  view  the  work  of  Geman  and 
Geman  (1984),  Rubin  (1987,  1988),  and  Tanner  and  Wong  (1987),  and  to  evaluate  its  potential  for  other 
structured  problems.  For  example,  in  addition  to  the  model  structures  given  in  Section  4,  the  methods  find 
immediate  and  powerful  application  to  problems  involving  ordered  random  variables,  or  involving  change- 
points.  We  shall  provide  detailed  and  extensive  numerical  illustration  of  a  number  of  such  problems  in  a 
subsequent  applications  paper. 

The  preliminary  computational  experience  reported  here  serves  to  illustrate  the  following  points; 

iterative,  adaptive  sampling  (substitution  or  Gibbs)  invariably  provides  better  value,  in  terms  of  efficient  use 
of  generated  variates,  than  an  equivalent  sample  size,  non-iterative,  one-off  approach  (Rubin),  provided  a 
suitable  structure  for  iterative  sampling  exists; 

in  problems  where  certain  reduced  conditionals  are  available,  there  is  scope  for  accelerating  the  substitution 
algorithm  so  that  it  becomes  more  efficient  (particularly  in  early  cycles)  than  the  Gib^"  algorithm;  however. 
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the  gain  in  efficiency  is  only  likely  to  show  markedly  when  the  number  of  reduced  conditionals  is  a  rela¬ 
tively  large  fraction  of  the  total  number  of  conditionals  involved  in  a  cycle; 

there  are  important  practical  problems  in  tuning  monitoring  and  stopping  rule  procedures  for  iterative  sam¬ 
pling  in  large-scale  complex  problems;  we  shall  report  on  these  in  the  applications  paper  referred  to  earlier. 

Finally,  we  note  that  even  in  cases  where  ultimate  convergence  of  the  iterative  sampling  procedures 
proves  slow,  moment  or  other  information  provided  by  a  few  initial  cycles  can  be  used  to  provide  highly 
effective  starting  values  for  more  sophisticated  numerical  or  analytic  approximation  techniques. 
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